## [1] 113937 81
The Dataset consists of 81 variables with almost 114,000 observations
## [1] "AmountDelinquent"
## [2] "AvailableBankcardCredit"
## [3] "BankcardUtilization"
## [4] "BorrowerAPR"
## [5] "BorrowerRate"
## [6] "BorrowerState"
## [7] "ClosedDate"
## [8] "CreditGrade"
## [9] "CreditScoreRangeLower"
## [10] "CreditScoreRangeUpper"
## [11] "CurrentCreditLines"
## [12] "CurrentDelinquencies"
## [13] "CurrentlyInGroup"
## [14] "DateCreditPulled"
## [15] "DebtToIncomeRatio"
## [16] "DelinquenciesLast7Years"
## [17] "EmploymentStatus"
## [18] "EmploymentStatusDuration"
## [19] "EstimatedEffectiveYield"
## [20] "EstimatedLoss"
## [21] "EstimatedReturn"
## [22] "FirstRecordedCreditLine"
## [23] "GroupKey"
## [24] "IncomeRange"
## [25] "IncomeVerifiable"
## [26] "InquiriesLast6Months"
## [27] "InvestmentFromFriendsAmount"
## [28] "InvestmentFromFriendsCount"
## [29] "Investors"
## [30] "IsBorrowerHomeowner"
## [31] "LenderYield"
## [32] "ListingCategory..numeric."
## [33] "ListingCreationDate"
## [34] "ListingKey"
## [35] "ListingNumber"
## [36] "LoanCurrentDaysDelinquent"
## [37] "LoanFirstDefaultedCycleNumber"
## [38] "LoanKey"
## [39] "LoanMonthsSinceOrigination"
## [40] "LoanNumber"
## [41] "LoanOriginalAmount"
## [42] "LoanOriginationDate"
## [43] "LoanOriginationQuarter"
## [44] "LoanStatus"
## [45] "LP_CollectionFees"
## [46] "LP_CustomerPayments"
## [47] "LP_CustomerPrincipalPayments"
## [48] "LP_GrossPrincipalLoss"
## [49] "LP_InterestandFees"
## [50] "LP_NetPrincipalLoss"
## [51] "LP_NonPrincipalRecoverypayments"
## [52] "LP_ServiceFees"
## [53] "MemberKey"
## [54] "MonthlyLoanPayment"
## [55] "Occupation"
## [56] "OnTimeProsperPayments"
## [57] "OpenCreditLines"
## [58] "OpenRevolvingAccounts"
## [59] "OpenRevolvingMonthlyPayment"
## [60] "PercentFunded"
## [61] "ProsperPaymentsLessThanOneMonthLate"
## [62] "ProsperPaymentsOneMonthPlusLate"
## [63] "ProsperPrincipalBorrowed"
## [64] "ProsperPrincipalOutstanding"
## [65] "ProsperRating..Alpha."
## [66] "ProsperRating..numeric."
## [67] "ProsperScore"
## [68] "PublicRecordsLast10Years"
## [69] "PublicRecordsLast12Months"
## [70] "Recommendations"
## [71] "RevolvingCreditBalance"
## [72] "ScorexChangeAtTimeOfListing"
## [73] "StatedMonthlyIncome"
## [74] "Term"
## [75] "TotalCreditLinespast7years"
## [76] "TotalInquiries"
## [77] "TotalProsperLoans"
## [78] "TotalProsperPaymentsBilled"
## [79] "TotalTrades"
## [80] "TradesNeverDelinquent..percentage."
## [81] "TradesOpenedLast6Months"
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## ListingKey ListingNumber
## 17A93590655669644DB4C06: 6 Min. : 4
## 349D3587495831350F0F648: 4 1st Qu.: 400919
## 47C1359638497431975670B: 4 Median : 600554
## 8474358854651984137201C: 4 Mean : 627886
## DE8535960513435199406CE: 4 3rd Qu.: 892634
## 04C13599434217079754AEE: 3 Max. :1255725
## (Other) :113912
## ListingCreationDate CreditGrade Term
## 2013-10-02 17:20:16.550000000: 6 :84984 Min. :12.00
## 2013-08-28 20:31:41.107000000: 4 C : 5649 1st Qu.:36.00
## 2013-09-08 09:27:44.853000000: 4 D : 5153 Median :36.00
## 2013-12-06 05:43:13.830000000: 4 B : 4389 Mean :40.83
## 2013-12-06 11:44:58.283000000: 4 AA : 3509 3rd Qu.:36.00
## 2013-08-21 07:25:22.360000000: 3 HR : 3508 Max. :60.00
## (Other) :113912 (Other): 6745
## LoanStatus ClosedDate
## Current :56576 :58848
## Completed :38074 2014-03-04 00:00:00: 105
## Chargedoff :11992 2014-02-19 00:00:00: 100
## Defaulted : 5018 2014-02-11 00:00:00: 92
## Past Due (1-15 days) : 806 2012-10-30 00:00:00: 81
## Past Due (31-60 days): 363 2013-02-26 00:00:00: 78
## (Other) : 1108 (Other) :54633
## BorrowerAPR BorrowerRate LenderYield
## Min. :0.00653 Min. :0.0000 Min. :-0.0100
## 1st Qu.:0.15629 1st Qu.:0.1340 1st Qu.: 0.1242
## Median :0.20976 Median :0.1840 Median : 0.1730
## Mean :0.21883 Mean :0.1928 Mean : 0.1827
## 3rd Qu.:0.28381 3rd Qu.:0.2500 3rd Qu.: 0.2400
## Max. :0.51229 Max. :0.4975 Max. : 0.4925
## NA's :25
## EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## Min. :-0.183 Min. :0.005 Min. :-0.183
## 1st Qu.: 0.116 1st Qu.:0.042 1st Qu.: 0.074
## Median : 0.162 Median :0.072 Median : 0.092
## Mean : 0.169 Mean :0.080 Mean : 0.096
## 3rd Qu.: 0.224 3rd Qu.:0.112 3rd Qu.: 0.117
## Max. : 0.320 Max. :0.366 Max. : 0.284
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## Min. :1.000 :29084 Min. : 1.00
## 1st Qu.:3.000 C :18345 1st Qu.: 4.00
## Median :4.000 B :15581 Median : 6.00
## Mean :4.072 A :14551 Mean : 5.95
## 3rd Qu.:5.000 D :14274 3rd Qu.: 8.00
## Max. :7.000 E : 9795 Max. :11.00
## NA's :29084 (Other):12307 NA's :29084
## ListingCategory..numeric. BorrowerState
## Min. : 0.000 CA :14717
## 1st Qu.: 1.000 TX : 6842
## Median : 1.000 NY : 6729
## Mean : 2.774 FL : 6720
## 3rd Qu.: 3.000 IL : 5921
## Max. :20.000 : 5515
## (Other):67493
## Occupation EmploymentStatus
## Other :28617 Employed :67322
## Professional :13628 Full-time :26355
## Computer Programmer : 4478 Self-employed: 6134
## Executive : 4311 Not available: 5347
## Teacher : 3759 Other : 3806
## Administrative Assistant: 3688 : 2255
## (Other) :55456 (Other) : 2718
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 False:56459 False:101218
## 1st Qu.: 26.00 True :57478 True : 12719
## Median : 67.00
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
## NA's :7625
## GroupKey DateCreditPulled
## :100596 2013-12-23 09:38:12: 6
## 783C3371218786870A73D20: 1140 2013-11-21 09:09:41: 4
## 3D4D3366260257624AB272D: 916 2013-12-06 05:43:16: 4
## 6A3B336601725506917317E: 698 2014-01-14 20:17:49: 4
## FEF83377364176536637E50: 611 2014-02-09 12:14:41: 4
## C9643379247860156A00EC0: 342 2013-09-27 22:04:54: 3
## (Other) : 9634 (Other) :113912
## CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.0 Min. : 19.0
## 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :591 NA's :591
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## : 697 Min. : 0.00 Min. : 0.00
## 1993-12-01 00:00:00: 185 1st Qu.: 7.00 1st Qu.: 6.00
## 1994-11-01 00:00:00: 178 Median :10.00 Median : 9.00
## 1995-11-01 00:00:00: 168 Mean :10.32 Mean : 9.26
## 1990-04-01 00:00:00: 161 3rd Qu.:13.00 3rd Qu.:12.00
## 1995-03-01 00:00:00: 159 Max. :59.00 Max. :54.00
## (Other) :112389 NA's :7604 NA's :7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 2.00 Min. : 0.00
## 1st Qu.: 17.00 1st Qu.: 4.00
## Median : 25.00 Median : 6.00
## Mean : 26.75 Mean : 6.97
## 3rd Qu.: 35.00 3rd Qu.: 9.00
## Max. :136.00 Max. :51.00
## NA's :697
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 114.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 271.0 Median : 1.000 Median : 4.000
## Mean : 398.3 Mean : 1.435 Mean : 5.584
## 3rd Qu.: 525.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :14985.0 Max. :105.000 Max. :379.000
## NA's :697 NA's :1159
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5921 Mean : 984.5 Mean : 4.155
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :83.0000 Max. :463881.0 Max. :99.000
## NA's :697 NA's :7622 NA's :990
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 3121
## Median : 0.0000 Median : 0.000 Median : 8549
## Mean : 0.3126 Mean : 0.015 Mean : 17599
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 19521
## Max. :38.0000 Max. :20.000 Max. :1435667
## NA's :697 NA's :7604 NA's :7604
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.000 Min. : 0 Min. : 0.00
## 1st Qu.:0.310 1st Qu.: 880 1st Qu.: 15.00
## Median :0.600 Median : 4100 Median : 22.00
## Mean :0.561 Mean : 11210 Mean : 23.23
## 3rd Qu.:0.840 3rd Qu.: 13180 3rd Qu.: 30.00
## Max. :5.950 Max. :646285 Max. :126.00
## NA's :7604 NA's :7544 NA's :7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## Min. :0.000 Min. : 0.000
## 1st Qu.:0.820 1st Qu.: 0.000
## Median :0.940 Median : 0.000
## Mean :0.886 Mean : 0.802
## 3rd Qu.:1.000 3rd Qu.: 1.000
## Max. :1.000 Max. :20.000
## NA's :7544 NA's :7544
## DebtToIncomeRatio IncomeRange IncomeVerifiable
## Min. : 0.000 $25,000-49,999:32192 False: 8669
## 1st Qu.: 0.140 $50,000-74,999:31050 True :105268
## Median : 0.220 $100,000+ :17337
## Mean : 0.276 $75,000-99,999:16916
## 3rd Qu.: 0.320 Not displayed : 7741
## Max. :10.010 $1-24,999 : 7274
## NA's :8554 (Other) : 1427
## StatedMonthlyIncome LoanKey TotalProsperLoans
## Min. : 0 CB1B37030986463208432A1: 6 Min. :0.00
## 1st Qu.: 3200 2DEE3698211017519D7333F: 4 1st Qu.:1.00
## Median : 4667 9F4B37043517554537C364C: 4 Median :1.00
## Mean : 5608 D895370150591392337ED6D: 4 Mean :1.42
## 3rd Qu.: 6825 E6FB37073953690388BC56D: 4 3rd Qu.:2.00
## Max. :1750003 0D8F37036734373301ED419: 3 Max. :8.00
## (Other) :113912 NA's :91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00
## Median : 16.00 Median : 15.00
## Mean : 22.93 Mean : 22.27
## 3rd Qu.: 33.00 3rd Qu.: 32.00
## Max. :141.00 Max. :141.00
## NA's :91852 NA's :91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00
## Mean : 0.61 Mean : 0.05
## 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :42.00 Max. :21.00
## NA's :91852 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 0 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6000 Median : 1627
## Mean : 8472 Mean : 2930
## 3rd Qu.:11000 3rd Qu.: 4127
## Max. :72499 Max. :23451
## NA's :91852 NA's :91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-209.00 Min. : 0.0
## 1st Qu.: -35.00 1st Qu.: 0.0
## Median : -3.00 Median : 0.0
## Mean : -3.22 Mean : 152.8
## 3rd Qu.: 25.00 3rd Qu.: 0.0
## Max. : 286.00 Max. :2704.0
## NA's :95009
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.0 Min. : 1
## 1st Qu.: 9.00 1st Qu.: 6.0 1st Qu.: 37332
## Median :14.00 Median : 21.0 Median : 68599
## Mean :16.27 Mean : 31.9 Mean : 69444
## 3rd Qu.:22.00 3rd Qu.: 65.0 3rd Qu.:101901
## Max. :44.00 Max. :100.0 Max. :136486
## NA's :96985
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 2014-01-22 00:00:00: 491 Q4 2013:14450
## 1st Qu.: 4000 2013-11-13 00:00:00: 490 Q1 2014:12172
## Median : 6500 2014-02-19 00:00:00: 439 Q3 2013: 9180
## Mean : 8337 2013-10-16 00:00:00: 434 Q2 2013: 7099
## 3rd Qu.:12000 2014-01-28 00:00:00: 339 Q3 2012: 5632
## Max. :35000 2013-09-24 00:00:00: 316 Q2 2012: 5061
## (Other) :111428 (Other):60343
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 63CA34120866140639431C9: 9 Min. : 0.0 Min. : -2.35
## 16083364744933457E57FB9: 8 1st Qu.: 131.6 1st Qu.: 1005.76
## 3A2F3380477699707C81385: 8 Median : 217.7 Median : 2583.83
## 4D9C3403302047712AD0CDD: 8 Mean : 272.5 Mean : 4183.08
## 739C338135235294782AE75: 8 3rd Qu.: 371.6 3rd Qu.: 5548.40
## 7E1733653050264822FAA3D: 8 Max. :2251.5 Max. :40702.39
## (Other) :113888
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : -2.35 Min. :-664.87
## 1st Qu.: 500.9 1st Qu.: 274.87 1st Qu.: -73.18
## Median : 1587.5 Median : 700.84 Median : -34.44
## Mean : 3105.5 Mean : 1077.54 Mean : -54.73
## 3rd Qu.: 4000.0 3rd Qu.: 1458.54 3rd Qu.: -13.92
## Max. :35000.0 Max. :15617.03 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-9274.75 Min. : -94.2 Min. : -954.5
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -14.24 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7000 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.00000
## Median : 0.00 Median :1.0000 Median : 0.00000
## Mean : 25.14 Mean :0.9986 Mean : 0.04803
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.00000
## Max. :21117.90 Max. :1.0125 Max. :39.00000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
# Set homeowner names from True & False to "Homeowner" & NotHomeowner"
levels(ld$IsBorrowerHomeowner)[levels(ld$IsBorrowerHomeowner) == 'True'] <- 'Homeowner'
levels(ld$IsBorrowerHomeowner)[levels(ld$IsBorrowerHomeowner) == "False"] <- 'NotHomeowner'
# Display Homeowners
ggplot(ld, aes(ld$IsBorrowerHomeowner)) +
geom_histogram(stat = "count", fill = 'orange')
##
## NotHomeowner Homeowner
## 56459 57478
Homeowner and NotHomeowner counts are almost the same. The homeowner category cuts the dataset in half. Later we we will use that in our analysis.
# DIsplay the legths of loan payments in months
ggplot(aes(x = Term), data = ld)+
geom_histogram(binwidth = 1, fill = I('#005b96')) +
scale_x_continuous(breaks = seq(0,60,12))
# Display the legths of loan payments in years
ggplot(aes(x = Term/12), data = ld)+
geom_histogram(binwidth = 1, fill = I('#005b96')) +
scale_x_continuous(breaks = seq(1,5,2)) +
xlab("Term in Years")
##
## 12 36 60
## 1614 87778 24545
All the borrowing terms are either 12, 36 or 60 months. Most of the terms are 3 years (90,000), many of them are 5 years (25,000) and a few are 1 year (1,500).
# Display interest rate
ggplot(aes(BorrowerRate), data =ld) +
geom_histogram(binwidth = .01, fill = 'black', color = 'darkred') +
scale_x_continuous(breaks = seq(0,.5,.05)) +
xlab('Interest Rate')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The interest rates are between 0 and 0.5 Most of the loans have interest rates between 0.05 and 0.35.
# Order ProsperRating(alpha) levels from the best rating to the lowest.
ld$ProsperRating..Alpha. <- ordered(ld$ProsperRating..Alpha., levels = c("", "AA", "A", "B", "C", "D", "E", "HR"))
# Display Prosper ratings
ggplot(aes(ProsperRating..Alpha.), data = ld) +
geom_bar(fill = I('#EB7260'), color = I( '#DD5F32')) +
xlab('Prosper Rating (alphabetical)')
We have no prosper rating available for about 30,000 loans. This number of the unknown rating category is even bigger than the number of the largest known rating category. Let’s get rid of the unknown bin to get a better look at the remaining ones.
# Display prosper ratings without unknown data.
ggplot(aes(ProsperRating..Alpha.), data = ld) +
geom_bar(fill = I('#EB7260'), color = I( '#DD5F32'))+
scale_x_discrete(limits = c("AA", "A", "B", "C", "D", "E", "HR")) +
xlab('Prosper Rating (alphabetical)')
As we can see, as we go towards to the middle ranked (C) rating from both sides, the count goes higher too. The x-axis is ordered from the highest rating (AA) to the lowest (HR).
# Best (AA) rated loans
ggplot(aes(BorrowerRate), data = subset(ld, ProsperRating..Alpha. == 'AA')) +
geom_histogram(fill = I('#1d8659')) +
scale_x_continuous(breaks = seq(0.05,0.2,0.015))+
ggtitle('Interest Rates of Highest Prosper Rated Loans')
## [1] 5372 81
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04000 0.06990 0.07790 0.07912 0.08450 0.21000
The highest class in prosper rating is skewed to the right. A borrower with an “AA” rating has a better chance to have a lower interest rate than the mean of its own group’s distribution. Most borrowers are between 0.055 and 0.09.
There are 5,400 people in the “AA” class.
# Worst (HR) rated loans
ggplot(aes(BorrowerRate), data = subset(ld, ProsperRating..Alpha. == 'HR')) +
geom_histogram(fill = I('#1d8659'), binwidth = .01) +
scale_x_continuous(breaks = seq(0.20,0.35,0.01)) +
ggtitle('Interest Rates of Lowest Prosper Rated Loans')
## [1] 6935 81
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1779 0.3134 0.3177 0.3173 0.3177 0.3600
This is a left skewed distribution with most borrowers between 0.3 and 0.325. This is a skinnier dataset. It means that the investors don’t really differentiate between the people in this group. If someone is a member of this worst rated group, he gets almost the same interest rate like the others in there. In this group, borrowers have to pay significantly higher interest rates. The difference between the medians of these two ranked groups is very huge. The worst ranked class has four times bigger median interest rate than the best ranked class does.
How do the values spread on this “HR” distribution? My assumption is that there will be only a few rates that have big counts. Let’s see.
##
## 0.1779 0.1789 0.1794 0.1799 0.18 0.1805 0.1806 0.181 0.1818 0.1823
## 2 1 1 1 4 1 1 1 4 2
## 0.1827 0.1829 0.1845 0.1875 0.1895 0.19 0.1925 0.195 0.1988 0.199
## 5 13 2 1 3 5 1 1 1 1
## 0.2 0.2003 0.2085 0.2095 0.21 0.22 0.2234 0.2285 0.2295 0.23
## 9 1 1 3 9 4 1 1 3 5
## 0.2322 0.238 0.24 0.2417 0.245 0.2487 0.2495 0.25 0.255 0.2574
## 1 1 4 1 2 1 1 12 1 1
## 0.2588 0.2595 0.26 0.265 0.2682 0.2695 0.27 0.2724 0.275 0.2785
## 1 4 12 1 1 1 10 1 2 2
## 0.279 0.2799 0.28 0.2841 0.2845 0.285 0.2872 0.2874 0.2888 0.289
## 1 2 9 1 2 4 1 7 1 1
## 0.29 0.2903 0.2924 0.2943 0.295 0.2955 0.297 0.2975 0.298 0.299
## 12 1 62 1 1 1 1 2 1 1
## 0.2994 0.2995 0.2998 0.2999 0.3 0.3009 0.301 0.3025 0.303 0.304
## 2 2 1 131 16 4 1 1 1 1
## 0.3049 0.305 0.3059 0.3075 0.3079 0.3089 0.3093 0.3094 0.3095 0.3097
## 1 1 113 26 3 1 1 1 4 1
## 0.3099 0.31 0.311 0.3121 0.3125 0.3127 0.3134 0.3174 0.3175 0.3177
## 4 55 1 1 268 218 722 1 1 3672
## 0.3179 0.3195 0.3199 0.32 0.321 0.3239 0.324 0.3248 0.3249 0.325
## 1 1 604 17 1 1 1 2 2 2
## 0.3267 0.3269 0.327 0.3275 0.328 0.3283 0.3285 0.3289 0.329 0.3295
## 1 1 2 3 2 1 1 1 3 1
## 0.3297 0.3299 0.33 0.3323 0.333 0.3334 0.3335 0.3345 0.335 0.3375
## 1 1 21 1 1 1 1 1 5 1
## 0.338 0.3384 0.3385 0.3387 0.339 0.3391 0.3395 0.3399 0.34 0.3411
## 1 1 1 1 2 1 2 3 30 1
## 0.3418 0.3423 0.3424 0.3425 0.3428 0.3433 0.344 0.3445 0.345 0.3459
## 1 1 2 1 1 1 3 2 8 1
## 0.346 0.3475 0.348 0.3484 0.3485 0.349 0.3494 0.3495 0.3498 0.3499
## 1 2 2 1 6 1 2 12 3 7
## 0.35 0.36
## 633 4
As i thought, there are a few interest rates concerned with almost everyone in this group. For example, 0.3177 interest rate is what almost half of this group pays. Let’s do some clean up, and remove all the rates from the table that have less than 15 counts:
subset(ld.prosperRating_by_worst, ld.prosperRating_by_worst >= 15)
##
## 0.2924 0.2999 0.3 0.3059 0.3075 0.31 0.3125 0.3127 0.3134 0.3177
## 62 131 16 113 26 55 268 218 722 3672
## 0.3199 0.32 0.33 0.34 0.35
## 604 17 21 30 633
The remaining set of data has 15 unique values! Only 15 separated interest rates have around 6500 counts in sum. And half of has less than 75.
Just to compare with the best rated class, I removed all values containing less than 15 counts:
#make a table with the best rated ProsperRating unique value counts
ld.prosperRating_AA_table <- table(ld.by_alpha_AA$BorrowerRate)
##
## 0.0499 0.0565 0.0599 0.0604 0.0605 0.061 0.0625 0.0629 0.0649 0.0655
## 45 81 46 27 235 26 28 85 160 101
## 0.0659 0.0666 0.0699 0.07 0.071 0.0715 0.0716 0.072 0.0724 0.0749
## 253 45 171 29 140 19 200 33 38 141
## 0.0759 0.0765 0.0766 0.0769 0.0779 0.0785 0.0789 0.0799 0.08 0.0804
## 125 23 125 350 39 26 36 219 57 23
## 0.0809 0.0814 0.0819 0.0825 0.083 0.0839 0.0845 0.0849 0.0854 0.0864
## 413 31 130 57 57 216 24 182 37 52
## 0.0869 0.0899 0.093 0.0945 0.096 0.0961 0.1 0.103 0.1042 0.105
## 222 76 64 21 19 57 15 16 20 35
## 0.1071 0.1076 0.1085 0.11 0.1101 0.115 0.1154 0.1199 0.1208
## 15 62 39 23 32 15 22 34 82
As we can see the distribution is much more separated, there are a lot more unique numbers in the best rated group. Therefore, there are more variables that can influence your interest rate in this distribution. So the prosper rate itself is not enough to tell within a small range what will be your interest rate. But in the lowest rated distribution it seems like it is rare, to get another rate than that very skinny range of values where 90% of the borrowers can be found.
#Display Loan Categories
ggplot(aes(ListingCategory..numeric.), data = ld) +
geom_bar(fill = I('#160A47'), color = 'darkblue') +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
xlab('Category')
Because this is a categorical, unordered distribution, we can not tell the shape of the distribution, because it does not make any sense. We can always transpose their bins to play with the shape of the distribution. As seeing this bar chart we can see that there are a lot of unknown data, around 17,000. I’ll remove this from the chart, because it does not give any useful information for me right now. And there is a very high column which is ruining the chart, called “Debt Consolidation”. I have to transform this chart somehow to get a better look at the lower data. Logarithm would be a good choice because the wide number range, but first let’s look at the counts.
##
## Auto Baby&Adoption Boat
## 2572 199 85
## Business Cosmetic Procedure Debt Consolidation
## 7189 91 58308
## Engagement Ring Green Loans Home Improvement
## 217 59 7433
## Household Expenses Large Purchases Medical/Dental
## 1996 876 1522
## Motorcycle Not Available Other
## 304 16965 10494
## Personal Loan RV Student Use
## 2395 52 756
## Taxes Vacation Wedding Loans
## 885 768 771
These category counts are separated in high range. There are a few counted data like “Green Loans”, and high counted, like “Home Improvement”. First let’s apply a logarithm transformation, because the square root will not give a good view in this wide number range.
We can apply the logarithm transformation, because every number is bigger than one! The problem with 0 values is they have no logarithm. And logarithm one is zero, so if we have a 1 counted category, it will display 0. This is why we have to be sure of this. We always have to check this condition.
So let’s apply a transformation and reorder the X-axis to have a better looking shape of the data. And make the visualization a little bit bigger for a better view
#Display Loan Categories without unknow data with reordered bins
ggplot(aes(x = reorder(Category , Freq), y = Freq),
data = subset(listing_category_df, Category != 'Not Available')) +
geom_bar(stat = 'identity' ,fill = I('#160A47'), color = I('darkblue')) +
scale_y_log10(breaks = c(0,10,50,100,250,500,1000,2000,5000,10000,60000)) +
coord_cartesian(ylim = c(1,60000)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
xlab('Category')
Perfect! Now we see all the data, all together. Logarithm transformation made a good job here, I won’t apply another rescale here. Watching this chart I can see that we can split these categories into 3 groups to better display the count differences between the groups. I’m going to give names to these groups for an easier reference in the future if it’s needed, and also because we get a better, more precise groups.
# Display low counted categories (RV-Motorcycle)
ggplot(low_counted_group.df, aes(Category, Freq)) +
geom_bar(stat = 'Identity' ,fill = I('#160A47'), color = I('darkblue')) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
ggtitle('Low Counted Categories')
## Category Freq
## 2 Baby&Adoption 199
## 3 Boat 85
## 5 Cosmetic Procedure 91
## 7 Engagement Ring 217
## 8 Green Loans 59
## 13 Motorcycle 304
## 17 RV 52
These categories are very unpopular choices for borrowers.
# Display medium counted categories (Student Use - Auto)
ggplot(medium_counted_group.df, aes(Category, Freq)) +
geom_bar(stat = 'Identity',fill = I('#160A47'), color = I('darkblue'))+
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
ggtitle('Medium Counted Categories')
## Category Freq
## 1 Auto 2572
## 10 Household Expenses 1996
## 11 Large Purchases 876
## 12 Medical/Dental 1522
## 16 Personal Loan 2395
## 18 Student Use 756
## 19 Taxes 885
## 20 Vacation 768
## 21 Wedding Loans 771
These are the categories that borrowers will most likely choose if they are not picking from the High Counted Categories.
# DIsplay high counted categories (Business - Debt Consolidation)
ggplot(high_counted_group.df, aes(Category, Freq)) +
geom_bar(stat = 'Identity',fill = I('#160A47'), color = I('darkblue')) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
ggtitle('High Counted Categories')
## Category Freq
## 4 Business 7189
## 6 Debt Consolidation 58308
## 9 Home Improvement 7433
## 15 Other 10494
The borrowers are most likely to pick a category from these 4.
What are the low-, medium- and high rated proportions of the categories? Of course I will remove the unknown data from the algorithm. It is very important, because if I would count them in, the sum of the proportions will not be 1.
Let’s see the proportions.
# Category proportion
ggplot(category_sums, aes(x =reorder(Category, - Freq), y =Freq/all_count)) +
geom_bar(stat = 'Identity', fill = I('#160A47')) +
xlab('Category')
## [1] 0.01038444 0.12932599 0.86028957
Just as I thought. “High” is very far from the rest of the data, roughly seven times the size of the “Medium”. The “Low” almost disappears next to the other categorical groups, but it still has around 1% slice from the loan data set. So every 100th borrower choose from those categories that the “Low” group contains. So there is 1% chance to choose from the “Low” groups which consists of 7 categories from the 20. Small, right? Third of the categories with sum of 1%. Almost nothing.
# Display monthly income
ggplot(aes(StatedMonthlyIncome), data = ld) +
geom_histogram(fill = I('#000000')) +
xlab('Monthly Income')
Well, there is one peak on the left but nothing more in the whole system. But why is there a 1,500,000 monthly income in the x axis? Does someone really has that much income? What is the scale for these income numbers?
## [1] 0 1750003
The maximum monthly income is 1,750,003. This completely ruins our chart. Is this real? Can it be a measurement mistake? Maybe a billionaire has this much income. Anyway, I am not going to care about this data. I am going to add a maximum scale, a 99 % quantile. This will hopefully make a lot skinnier scale, to see the other counts.
# Display monthly income with a 99% quantile
ggplot(aes(StatedMonthlyIncome), data = ld) +
geom_histogram(binwidth = 500, fill = I('#000000')) +
scale_x_continuous(limits = c(0, quantile(ld$StatedMonthlyIncome, .99))) +
xlab('Monthly Income')
Here it is! Looks better. This is a positively skewed distribution just as everyone expected it.
I expect that there are incomes that has way more counts than others. Like the very rounded numbers, for example 5000, 2500, 1000, et cetera.
# Display monthly income with a 99% quantile and smaller bin numbers
ggplot(aes(x = StatedMonthlyIncome), data = ld) +
geom_histogram(binwidth = 100, fill = I('#000000')) +
scale_x_continuous(limits = c(0, quantile(ld$StatedMonthlyIncome, .99))) +
xlab('Monthly Income')
##
## 2500 3333.333333 3750 4166.666667 4583.333333 5000
## 2256 2917 2428 3526 2211 3389
## 5416.666667 5833.333333 6250 6666.666667
## 2374 2319 2276 2162
As I expected. I Listed those values where the borrower count is more than 2,000. I think values like 2500, 3750, 5000, 6250 comes from when the employee and the employer make a deal about his gross payment. They will not say that, let it be 4,996 or 4,984. They will make it 5,000. The rest comes from the net payment. For example they make it net 2000, but maybe before taxes may have been 3333 or 4166. It makes sense. The same system works on the gross and on the net income.
#Display monthly loan payment
ggplot(aes(MonthlyLoanPayment), data = ld) +
geom_histogram(color = I('#ff1a8c'), fill = I('#660066')) +
xlab('Monthly Loan Payment')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2251.5
Positively skewed dataset. Nothing special. People rarely pay more than 1,000 in a month. Most of the data is below 220.Half of the data is between 131 and 372. I am going to scale the x axis from 0 to 1,000.
#Display monthly loan payment & Rescale to 1,000
ggplot(aes(x = MonthlyLoanPayment), data = ld) +
geom_histogram(color = I('#ff1a8c'), fill = I('#660066'), binwidth = 25) +
scale_x_continuous(breaks = seq(0,1000,50), limits = c(0,1000)) +
xlim(0,1000) +
xlab('Monthly Loan Payment')
From 0 to 175 increasing, and then decreasing till 750. There is a little peak around 850, and then it is going to be almost nothing. After 400 it suddenly falls to the half of the previous height. From the statistical numbers, half of the people is at 217, so most people stand before 220. But from the shape of the chart it is maybe better to say that the monthly amount of payment is popular until 400.
# Display borrowed loan
ggplot(aes(LoanOriginalAmount), data = ld) +
geom_histogram(fill = I('#00ba4a'), color =I('#aa0052'), binwidth = 1000) +
scale_x_continuous(breaks = seq(0,35000,5000)) +
xlab('Amount of Loan')
The shape of this distribution is positively skewed again. I explained previously the reason for the growing bins from the rounded numbers. People like to borrow monstly rounded numbers. Like 5,000, 10,000, 15,000, et cetera. To prove this, let’s change the bin’s width to a skinnier one.
# Display borrowed loan and 100 bin width
ggplot(aes(LoanOriginalAmount), data = ld) +
geom_histogram(fill = I('#00ba4a'), binwidth = 100) +
scale_x_continuous(breaks = seq(0,35000,5000)) +
xlab('Amount of Loan')
There it is! As I expected, they grow out from those numbers. I made a mistake, the first one is not 5,000 but rather 4,000. I think those sequentially grow out, the sequential number is 1,000 or 500. Let’s prove these statements with real numbers! I will list values with counts greater than or equal to 2,000.
ld.by_monthly_payment.tb <- table(ld$LoanOriginalAmount)
##
## 1000 2000 2500 3000 3500 4000 5000 6000 7000 7500 8000 10000
## 3206 6067 2992 5749 2567 14333 6990 2869 2949 2975 2442 11106
## 15000 20000 25000
## 12407 3291 3630
So what we can tell is, people borrow very rounded numbers. From 1,000 to 5,000 the biggest step size is 1,000, the smallest is 500. 4,000 borrower count is way more than 5,000. Maybe they think 5,000 is too much money to borrow, let’s borrow only 4,000. Even if they need more, they just can’t risk not being able to pay it back. Very few people borrow 4,500, maybe because it is too close to 5000 or 4000 and people that, it is that close to 4-5,000, maybe they should go for the 4,000 or the 5,000. And they just rethink that 5,000 is too much, 4,500 is not that far from it, and maybe it is much too. So let’s go back to 4,000, it is safer.
From 5,000 to 10,000 step size is a 1,000. They are not going for the X,500. They like rounded numbers. 9,000 is too low. Who borrow 10,000 possibly have more money, so they can go for the 10,000. It is a nicer number. 7,500 is also a nice number, from 10,000 to 35,000 the step size is 5,000. Anyone who borrow from that range won’t just go for, let’s say 32,000, they will go for 30,000 or 35,000.
# Display delinquent days
ggplot(aes(ld$LoanCurrentDaysDelinquent), data = ld) +
geom_histogram( fill =I('#00e6b8')) +
ggtitle("Count of Delinquent Days") +
xlab("Delinquent Days")
100,000 counted column at the start destroys our chart. Let’s examine what it is.
# Display delinquent days and rescale
ggplot(aes(ld$LoanCurrentDaysDelinquent), data = ld) +
geom_histogram(binwidth = 1, fill =I('#00e6b8')) +
scale_x_continuous(limits = c(-1,100)) +
xlab("Delinquent Days")
It is a zero. from 110,000 people, 90,000 pays his debt in time. Very good. We should not bother ourselves, just delete it from the chart to see the others.
# Display delinquent days without zero
ggplot(aes(ld$LoanCurrentDaysDelinquent), data = ld) +
geom_histogram(binwidth = 10, fill =I('#00e6b8')) +
scale_x_continuous(limits = c(1,2500)) +
xlab("Delinquent Days")
There are 2 peaks in the data. One at the beginning, and one around 2,000. We can split our data set into two parts. [1;1000] & ]1000;2500].
# Display delinquent days from 1 to 1,000
ggplot(aes(ld$LoanCurrentDaysDelinquent), data = ld) +
geom_histogram(binwidth = 10, fill =I('#00e6b8')) +
scale_x_continuous(limits = c(1,1000)) +
xlab("Delinquent Days")
There are some who late for a few days, after that it goes back to normal. At ~125 there is one outstanding bin.
# Display delinquent days from 1,001 to 2,500
ggplot(aes(ld$LoanCurrentDaysDelinquent), data = ld) +
geom_histogram(binwidth = 10, fill =I('#00e6b8')) +
scale_x_continuous(limits = c(1001,2500)) +
xlab("Delinquent Days")
Something bothers me, are these outstanding bins sequentially repeated? This can be seen in the first part and in the second part I’ll examine this later, but first let’s find out what is that strange bin at ~125.
# Display delinquent days, investigate outstanding bin at around 125
ggplot(aes(ld$LoanCurrentDaysDelinquent), data = ld) +
geom_histogram(binwidth = 1, fill =I('#00e6b8')) +
scale_x_continuous(limits = c(100,130), breaks = seq(100,130,1)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
xlab("Delinquent Days")
It is at 121. What is it? It is a quarter of a year. Maybe after 1 quarter and 1 day they have to pay late charges. So people who don’t really have money, wait until the last day.
And now let’s examine the sequence.
# Display delinquent days
#Investigate outstanding bin sequency
#First year without 0
ggplot(aes(x =LoanCurrentDaysDelinquent),
data = subset(ld, LoanCurrentDaysDelinquent >0)) +
geom_histogram(binwidth =1, alpha = 1, fill =I('#00e6b8')) +
scale_x_continuous(breaks = seq(0,365,30), limits = c(0,365)) +
xlab("Delinquent Days")
Not a really good view, value 121 is too high. Let’s zoom in. And it would be better if I would change the chart’s opacity value lower to better pair the downsizes to the x axis and also make the chart wider for a more precise examination.
# Display delinquent days
#Investigate outstanding bin sequency
#First year without 0
# Zoom in
ggplot(aes(x =LoanCurrentDaysDelinquent), data = ld) +
geom_histogram(binwidth =1, alpha =.5, fill =I('#00e6b8')) +
scale_x_continuous(breaks = seq(0,365,30), limits = c(0,365)) +
coord_cartesian(ylim = c(0,30)) +
ggtitle('Days Delinquent (First Year)') +
xlab("Delinquent Days")
It is hard to recognize exactly where the data goes down on the x axis. I do not see any strong sequential in the downsizes. Maybe there is some, but I cannot tell. Let’s examine this from 0 to 364 interval in the first six years by apply a division with remainder. Maybe it will refine the chart “curves”.
# Display delinquent days
#Investigate outstanding bin sequency
#6 years (division with remainder)
# Zoom in
ggplot(aes(x =LoanCurrentDaysDelinquent%% 365, y=..count.., fill =..count..),
data = subset(ld, LoanCurrentDaysDelinquent > 0 &
LoanCurrentDaysDelinquent< 2190)) +
geom_histogram(binwidth = 1, alpha =.5, fill =I('#00e6b8')) +
scale_x_continuous(limits = c(0,364), breaks = seq(0,364,30)) +
coord_cartesian(ylim = c(0,75)) +
xlab("Delinquent Days")
Yes, it made the decreases more clear. What i see is that there are decreases after a couple of days from the 30 multiplications. I want to be more precise and show them better by changing the x axis scale labels.
# Display delinquent days
#Investigate outstanding bin sequency
#First year without 0
# Zoom in
# Rescale x-axis
ggplot(aes(x =LoanCurrentDaysDelinquent%% 365, y=..count.., fill =..count..),
data = subset(ld, LoanCurrentDaysDelinquent > 0 &
LoanCurrentDaysDelinquent< 2190)) +
geom_histogram(binwidth = 1, alpha =.5, fill =I('#00e6b8')) +
scale_x_continuous(limits = c(0,364), breaks = seq(6,364,30)) +
coord_cartesian(ylim = c(0,75)) +
xlab("Delinquent Days")
Here it is. We can see the downsizes at the x-axis breaks.
# Display Debt to Income Ratio
ggplot(aes(ld$DebtToIncomeRatio), data = ld)+
geom_histogram(fill = I('#862d2d')) +
xlab('Debt to Income Ratio')
Wow! someone has 10 times more debt than his income. Thanks for him, we have to rescale the x axis again. Let’s do it.
# Display Debt to Income Ratio
# Limit x-axis
ggplot(aes(ld$DebtToIncomeRatio), data = ld)+
geom_histogram(fill = I('#862d2d'), binwidth = .01) +
scale_x_continuous(limits = c(0,1), breaks = seq(-.25,1,.25)) +
xlab('Debt to Income Ratio')
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
What do we have here? A positively skewed distribution. Most of the data is between 0 and 0.5. Median is 0.22, so half of the people pay less than or equal to 22% of their monthly income. That is a very acceptable rate.
There are 113,837 observations with 82 variables. I am not investigating all of the variables, just the important ones from them, like prosper rate, interest rate, “is borrower a homeowner” et cetera. There are categorical, numerical variables. And also date formats, but I will not examine them.
The main features in this data set are the interest rates, prosper rating and the borrowed loan. I’m sure that prosper rating has an influence to the interest rate. And maybe some other variables, like the amount of the borrowed loan have too.
My assumption is, the “term” and the “is borrower a homeowner” will help me later.
I changed the “IsBorrowerHomeowner” factors from True and False to “Homeowner” and “NotHomeowner”. Because it is easier to plot.
No, I did not.
# Interest rate & Prosper ratings
ggplot(aes(BorrowerRate), data = ld) +
geom_histogram(fill = 'black') +
facet_wrap(~ProsperRating..Alpha., ncol = 1) +
scale_x_continuous(limits = c(0,.4)) +
xlab('Interest Rate')
As watching this plot, we can see how the distribution positions move to the right as we go down to a lower rated category. Let’s compare them in one plot without the unknown data.
# Interest rate & Prosper ratings
ggplot(aes(BorrowerRate),
data = subset(ld, ld$ProsperRating..Alpha. != "Unknown")) +
geom_freqpoly(aes(color = ProsperRating..Alpha.), size = 1, binwidth = .01) +
scale_x_continuous(limits = c(0,.4)) +
xlab('Interest Rate')+
labs(color ="Prosper Rating")
I also removed the Unknown data so it won’t be crossing every single distribution and ruin the sight.
## ld$ProsperRating..Alpha.: Unknown
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1269 0.1700 0.1833 0.2364 0.4975
## --------------------------------------------------------
## ld$ProsperRating..Alpha.: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04000 0.06990 0.07790 0.07912 0.08450 0.21000
## --------------------------------------------------------
## ld$ProsperRating..Alpha.: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0498 0.0990 0.1119 0.1129 0.1239 0.2150
## --------------------------------------------------------
## ld$ProsperRating..Alpha.: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0693 0.1414 0.1509 0.1545 0.1639 0.3500
## --------------------------------------------------------
## ld$ProsperRating..Alpha.: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0895 0.1765 0.1914 0.1944 0.2099 0.3500
## --------------------------------------------------------
## ld$ProsperRating..Alpha.: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1157 0.2287 0.2492 0.2464 0.2625 0.3500
## --------------------------------------------------------
## ld$ProsperRating..Alpha.: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1479 0.2712 0.2925 0.2933 0.3149 0.3600
## --------------------------------------------------------
## ld$ProsperRating..Alpha.: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1779 0.3134 0.3177 0.3173 0.3177 0.3600
If we get a better Prosper rating, the lower the Interest rates will be. There is an interesting thing about these ratings. From B to HR the maximum interest rate is almost equal. But the ‘AA’ and ‘A’ maximum interest rate is way lower than the others. HR category is a very thin distribution. Q1 is 0.3134, Q3 is 0.3177. So 50% of the data is between this two numbers, with range of only 0.0043. Let’s show this statistical data in a plot.
# Interest rate & Proper rating
ggplot(subset(ld, ld$ProsperRating..Alpha. != "Unknown"),
aes(ProsperRating..Alpha., BorrowerRate)) +
geom_boxplot() +
ylab('Interest Rate')
This plot describes the tendency very well. Median, Q1 and Q3 grow every time. More outlier goes up than down until D where it changes this routine into its opposite direction. This shows us very well, how thin the HR category is.
# Monthly Income & monthly Loan Payment
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment)) +
geom_point() +
ylab("Monthly Loan Payment") +
xlab("Monthly Income")
There are few values really far from others. Let’s add a limit to the x-axis, to see that group on the left.
# Monthly Income & monthly Loan Payment
# Limit x axis
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment)) +
geom_point() +
scale_x_continuous(limits = c(0,15000)) +
ylim(c(0,1000)) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income")
Interesting. Why is there 2 groups on the plot(one big in the middle and one upwards)? Maybe later i will get an answer for this by adding some other variables to the plot. But first to get a better idea about the values, change the alpha level.
# Monthly Income & monthly Loan Payment
# Limit x axis
# Lower alpha level
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment)) +
geom_point(alpha = 1/30) +
scale_x_continuous(limits = c(0,15000)) +
ylim(c(0,1000)) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income")
There is a stairway looking growth between Income and Debt Payment. As income grows, occasionally debt payment grows too. There is not a strong relationship in the plot, let’s see what the correlation is.
##
## Pearson's product-moment correlation
##
## data: ld$StatedMonthlyIncome and ld$MonthlyLoanPayment
## t = 67.764, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1912423 0.2024055
## sample estimates:
## cor
## 0.1968303
# Borrowed Money and Monthly Loan Payment
ggplot(ld, aes(ld$LoanOriginalAmount, ld$MonthlyLoanPayment)) +
geom_point() +
ylab("Monthly Loan Payment") +
xlab("Amount of Loan")
# Borrowed Money and Monthly Loan Payment
# change alpha level
ggplot(ld, aes(ld$LoanOriginalAmount, ld$MonthlyLoanPayment)) +
geom_point(alpha =1/60) +
xlim(c(0,25000)) +
ylim(c(0,1250)) +
ylab("Monthly Loan Payment") +
xlab("Amount of Loan")
I see 3 trend lines in the plot. Two of them just little differ from each other, the third one is way higher. I remember, there was 3 terms, telling how much time it takes to pay back the loan. 1,3,5 years. I’m pretty sure these 3 trend lines are connected to these terms. Later, I am going to examine this idea by adding the “Term” variable to the plot.
##
## Pearson's product-moment correlation
##
## data: ld$LoanOriginalAmount and ld$MonthlyLoanPayment
## t = 867.82, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9312165 0.9327426
## sample estimates:
## cor
## 0.9319837
There is a really strong relationship between the two variables. As the borrowed amount of loan grows, the monthly payment grows too.
# Interest Rate and Borrowed Money
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount)) +
geom_point() +
ylab("Amount of Loan") +
xlab("Interest Rate")
# Interest Rate and Borrowed Money
# Change alpha level
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount)) +
geom_point(alpha = 1/30) +
ylab("Amount of Loan") +
xlab("Interest Rate")
# Interest Rate and Borrowed Money
# Change alpha level
# Limit x
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount)) +
geom_point(alpha = 1/25) +
xlim(c(0.05,0.35)) +
ylim(c(0,25000)) +
ylab("Amount of Loan") +
xlab("Interest Rate")
People get money in a high range of interest rate. This plot will be interesting when I’ll add the prosper rating variable. We already know that, borrowers in a lower rated prosper going to get loan with bigger interest rate. I can barely see the main shape of the dataset, let’s add the median to the dataset.
# Interest Rate and Borrowed Money
# Change alpha level
# Limits x
# Statistical curves
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount)) +
geom_point(alpha = 1/25) +
xlim(c(0.05,0.35)) +
ylim(c(0,25000)) +
geom_line(stat = 'summary', fun.y = median, color = "blue", alpha = 0.5) +
geom_smooth(color = 'red') +
ylab("Amount of Loan") +
xlab("Interest Rate")
Blue line is the median. Red line removes the “noises” from the dataset, so it is not jumping like the median’s line.
##
## Pearson's product-moment correlation
##
## data: ld$BorrowerRate and ld$LoanOriginalAmount
## t = -117.58, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3341283 -0.3237719
## sample estimates:
## cor
## -0.3289599
There is a small, but meaningful correlation between the variables. As Amount of the loan grows, the Interest Rate decreases.
# Prosper Rating and Borrowed Loan
# Without unknown prosper rating
ggplot(subset(ld, ld$ProsperRating..Alpha. != 'Unknown'), aes(ProsperRating..Alpha., LoanOriginalAmount)) +
geom_boxplot() +
ylab("Amount of Loan") +
xlab("Proper Rating")
## ld_prosper_rationg_without_unknown$ProsperRating..Alpha.: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 6000 10940 11584 16000 35000
## --------------------------------------------------------
## ld_prosper_rationg_without_unknown$ProsperRating..Alpha.: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5850 10000 11460 15000 35000
## --------------------------------------------------------
## ld_prosper_rationg_without_unknown$ProsperRating..Alpha.: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 6000 10000 11622 15000 35000
## --------------------------------------------------------
## ld_prosper_rationg_without_unknown$ProsperRating..Alpha.: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5000 10000 10392 15000 25000
## --------------------------------------------------------
## ld_prosper_rationg_without_unknown$ProsperRating..Alpha.: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6100 7083 10000 15000
## --------------------------------------------------------
## ld_prosper_rationg_without_unknown$ProsperRating..Alpha.: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3600 4000 4586 5000 15900
## --------------------------------------------------------
## ld_prosper_rationg_without_unknown$ProsperRating..Alpha.: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 4000 3463 4000 16800
People with better prosper rating borrowed more money. It is logical, because someone who has more money, can borrow more, because they can pay back more every month.
There is no big difference between AA, A, B, C in Q1, median and Q3. The difference is seen at the values outside of this area. A,B has significantly more outlier values than the other
D is like a bridge between AA-C & E-HR ratings. D rating’s Q3 is around C rating’s median, and Q1 is around E rating’s median. E and HR needed way less money than other ratings.
# Borrowed loan # Categories
ggplot(subset(ld, ld$ListingCategory..numeric. != 'Not Available'),
aes(reorder(ListingCategory..numeric., -LoanOriginalAmount, median),
LoanOriginalAmount)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
ylab("Amount of Loan") +
xlab("Category")
As we can see, outliers are always above the Q3. I have sorted the boxplots by their medians. There is no huge difference between Q1s for a while from Debt Consolidation to Taxes. In the first categories, there is a higher range in the borrowed money.
Baby$Adobtion takes the second place, but previously we saw that, there is only a few people in that category.
# Interest Rate & Homeowner
ggplot(ld, (aes(ld$IsBorrowerHomeowner, ld$BorrowerRate))) +
geom_boxplot() +
ylab("Interest Rate") +
xlab("Is Borrower Homeowner")
## ld$IsBorrowerHomeowner: NotHomeowner
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1449 0.1980 0.2029 0.2624 0.4975
## --------------------------------------------------------
## ld$IsBorrowerHomeowner: Homeowner
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1239 0.1700 0.1828 0.2394 0.3600
As seen, the homeowner borrowers have lower interest rates. Homeowner median is 0.17, while the not homeowner is almost 0.2.
# Times of Delinquencies & Amount of Loan
ggplot(ld,aes(x =ld$LoanOriginalAmount, y = ld$DelinquenciesLast7Years)) +
geom_point() +
ylab("Delinquencies in the Last 7 Years") +
xlab("Amount of Loan")
Too much point on the plot. Let’s change the alpha level.
ggplot(ld,aes(x =ld$LoanOriginalAmount, y = ld$DelinquenciesLast7Years)) +
geom_point(alpha = 1/35) +
ylab("Delinquencies in the Last 7 Years") +
xlab("Amount of Loan")
Much better. But the points are too small. Let’s change the x limits.
ggplot(ld,aes(x =ld$LoanOriginalAmount, y = ld$DelinquenciesLast7Years)) +
geom_point(alpha = 1/35, position = position_jitter(width = 100)) +
xlim(c(0,25000)) +
ylab("Delinquencies in the Last 7 Years") +
xlab("Amount of Loan")
There are way more counts than the others at around 2,000, 3000, 4000, 5000, 10,000 et cetera. Likely because at those values there are more people who borrowed money. I’ll add those counts to the plot later, to see the connection.
I Have found a relationship between interest rate and prosper rating. Higher ranked prosper rated people have less interest rates.
There is also a connection between the amount of the loan and the monthly payment, a 0.93 correlation coefficient. And it seems like there is also a connection with the term variable.
If the prosper rating is “HR” we can easily tell what the interest rate will be with a high percentage.
No, it seems like the variables are connected with the featured ones.
The amount of the loan and the monthly payment. There is a really big relationship.
# Amount_and_deliquencies plot with borrowed amount counts
ggplot(ld, aes(ld$LoanOriginalAmount, ld$DelinquenciesLast7Years)) +
geom_point(alpha = 1/35, color = 'red',
position = position_jitter(width = 100)) +
geom_line(aes(ld$LoanOriginalAmount, ..count../150),
color = 'blue', stat = 'bin', binwidth = 100, alpha = 0.75) +
scale_y_continuous(sec.axis = sec_axis(~. * 150 ,
name = 'Borrowed amount count')) +
theme(axis.text.y = element_text(color = 'red')) +
theme(axis.text.y.right = element_text(color = 'blue')) +
theme(axis.title.y = element_text(color = 'red')) +
theme(axis.title.y.right = element_text(color = 'blue')) +
ylab("Delinquencies in the Last 7 Years") +
xlab("Amount of Loan")
As I thought, the “Borrowed Amount Count” perfectly fits onto the relationship between the loan amount and delinquencies in the last 7 years. This correlates to the connection between the other two.
In the second plots I added an 1/25 alpha level.
# Interest Rate and Borrowed Money & Term length
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount,
color = factor(ld$Term)),) +
geom_point(size = 3) +
xlim(c(0.05,0.35)) +
ylim(c(0,25000)) +
guides(col = guide_legend(override.aes =
list(shape = 15, size = 10, alpha = 1))) +
scale_color_manual(values = term_colors) +
ylab("Amount of Loan") +
xlab("Interest Rate") +
labs(color = "Term")
# Interest Rate and Borrowed Money & Term length
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount,
color = factor(ld$Term)),) +
geom_point(size = 4, alpha = 1/25) +
xlim(c(0.05,0.35)) +
ylim(c(0,25000)) +
guides(col = guide_legend(override.aes =
list(shape = 15, size = 10, alpha = 1))) +
scale_color_manual(values = term_colors) +
ylab("Amount of Loan") +
xlab("Interest Rate") +
labs(color = "Term")
Borrowers with longer payment term and bigger borrowed money often have the same interest rates as the shorter term borrowers with lower loans.
# Interest Rate and Borrowed Money & prosper rating
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount,
color = ld$ProsperRating..Alpha.)) +
geom_point( size = 4) +
xlim(c(0.05,0.35)) +
ylim(c(0,25000)) +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
scale_color_manual(values = prosper_rating_colors ) +
ylab("Amount of Loan") +
xlab("Interest Rate") +
labs(color = "Prosper Rating")
# Interest Rate and Borrowed Money & prosper rating
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount,
color = ld$ProsperRating..Alpha.)) +
geom_point( size = 4, alpha = 1/25) +
xlim(c(0.05,0.35)) +
ylim(c(0,25000)) +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
scale_color_manual(values = prosper_rating_colors ) +
ylab("Amount of Loan") +
xlab("Interest Rate") +
labs(color = "Prosper Rating")
Interesting. Usually it does not really matter how much money do you borrow, because the prosper rating will be the key when they calculate the interest rate at better prosper ratings. In lower prosper ratings it counts more until the HR category.
# Interest Rate and Borrowed Money & Is Homeowner
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount,
color = ld$IsBorrowerHomeowner)) +
geom_point(size = 4) +
xlim(c(0.05,0.35)) +
ylim(c(0,25000)) +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
scale_color_manual(values = is_homeowner_colors) +
ylab("Amount of Loan") +
xlab("Interest Rate") +
labs(color = "Is Borrower Homeowner")
# Interest Rate and Borrowed Money & Is Homeowner
ggplot(ld, aes(ld$BorrowerRate, ld$LoanOriginalAmount,
color = ld$IsBorrowerHomeowner)) +
geom_point(size = 4, alpha = 1/25) +
xlim(c(0.05,0.35)) +
ylim(c(0,25000)) +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
scale_color_manual(values = is_homeowner_colors) +
ylab("Amount of Loan") +
xlab("Interest Rate") +
labs(color = "Is Borrower Homeowner")
Homeowners are grouping in the left upper corner (lower interest rate, bigger loan), meanwhile the rest of the people are going to the right lower corner (bigger interest rate, lower loan).
In the second plots I added an 1/75 alpha level.
#Amount of Loan & Monthly Loan Payment & Term
ggplot(ld, aes(ld$LoanOriginalAmount, ld$MonthlyLoanPayment,
color = factor(ld$Term))) +
geom_point(size = 4) +
scale_color_manual(values = term_colors) +
ylab("Monthly Loan Payment") +
xlab("Amount of Loan") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Term")
#Amount of Loan & Monthly Loan Payment & Term
ggplot(ld, aes(ld$LoanOriginalAmount, ld$MonthlyLoanPayment,
color = factor(ld$Term))) +
geom_point(size = 4, alpha = 1/75) +
scale_color_manual(values = term_colors) +
ylab("Monthly Loan Payment") +
xlab("Amount of Loan") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Term")
We know that, there are 12, 36 and 60 months long terms. It is trivial who borrowed money for a longer period paid less every month. But there are some interesting points. How can two 1 year long points be at the bottom of the Y axis at around 12,000?
#Amount of Loan & Monthly Loan Payment & Propser Rating
ggplot(ld, aes(ld$LoanOriginalAmount, ld$MonthlyLoanPayment,
color = ld$ProsperRating..Alpha.)) +
geom_point(size = 4) +
scale_color_manual(values = prosper_rating_colors ) +
ylab("Monthly Loan Payment") +
xlab("Amount of Loan") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Prosper Rating")
#Amount of Loan & Monthly Loan Payment & Propser Rating
ggplot(ld, aes(ld$LoanOriginalAmount, ld$MonthlyLoanPayment,
color = ld$ProsperRating..Alpha.)) +
geom_point(size = 4, alpha = 1/75) +
scale_color_manual(values = prosper_rating_colors ) +
ylab("Monthly Loan Payment") +
xlab("Amount of Loan") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Prosper Rating")
We still don’t know the answer, because we don’t have information about the points, next to them (red ones) we see that, they are “A” and “B” rated points. But without knowing the neighbor points, it is hard to tell. But maybe because they have good prosper ratings with a special offer. Like pay back in a year, with a low monthly payment. And if he gets money soon, he can pay back the whole loan then.
From this plot it is seen that most of the missing data comes from the 3 years long term.
#Amount of Loan & Monthly Loan Payment & IsHomeowner
ggplot(ld, aes(ld$LoanOriginalAmount, ld$MonthlyLoanPayment,
color = ld$IsBorrowerHomeowner)) +
geom_point(size = 3) +
scale_color_manual(values = is_homeowner_colors) +
ylab("Monthly Loan Payment") +
xlab("Amount of Loan") +
labs(color = "Is Borrower Homeowner")
#Amount of Loan & Monthly Loan Payment & IsHomeowner
ggplot(ld, aes(ld$LoanOriginalAmount, ld$MonthlyLoanPayment,
color = ld$IsBorrowerHomeowner)) +
geom_point(size = 3, alpha = 1/75) +
scale_color_manual(values = is_homeowner_colors) +
ylab("Monthly Loan Payment") +
xlab("Amount of Loan") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Is Borrower Homeowner")
Homeowners are more likely to go for a 3 or a 5 years long payment terms. People who don’t own a home borrow less money than homeowners.
In the second plots I added an 1/50 alpha level.
# Monthly Income & Monthly Loan Payment & Prosper Rating
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment,
color = ld$ProsperRating..Alpha.)) +
geom_point(size = 3) +
xlim(c(0,50000)) +
scale_colour_manual(values = prosper_rating_colors) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income") +
labs(color = "Prosper Rating")
# Monthly Income & Monthly Loan Payment & Prosper Rating
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment,
color = ld$ProsperRating..Alpha.)) +
geom_point(size = 3, alpha = 1/50) +
xlim(c(0,50000)) +
scale_colour_manual(values = prosper_rating_colors) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Prosper Rating")
# Monthly Income & Monthly Loan Payment & Prosper Rating
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment,
color = ld$ProsperRating..Alpha.)) +
geom_point(size = 3, alpha = 1/50) +
xlim(c(0,20000)) +
ylim(0,1500) +
scale_colour_manual(values = prosper_rating_colors) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Prosper Rating")
Another interesting plot. The best rated people are in the middle of the group. “HR” category has almost the same monthly loan payment for everyone. “A” and “B” categories have wider and higher ranges than others.
# Monthly Income & Monthly Loan Payment & Is Borrower Homeowner
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment,
color = ld$IsBorrowerHomeowner)) +
geom_point(size = 3) +
xlim(c(0,50000)) +
scale_colour_manual(values = is_homeowner_colors) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income") +
labs(color = "Is Borrower Homeowner")
# Monthly Income & Monthly Loan Payment & Is Borrower Homeowner
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment,
color = ld$IsBorrowerHomeowner)) +
geom_point(size = 3, alpha = 1/50) +
xlim(c(0,50000)) +
scale_colour_manual(values = is_homeowner_colors) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Is Borrower Homeowner")
# Monthly Income & Monthly Loan Payment & Is Borrower Homeowner
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment,
color = ld$IsBorrowerHomeowner)) +
geom_point(size = 3, alpha = 1/50) +
xlim(c(0,20000)) +
ylim(0,1500) +
scale_colour_manual(values = is_homeowner_colors) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Is Borrower Homeowner")
Someone who does not own a home is more likely to go for a smaller loan with a lower monthly payment. But home owners are not interested in that, just only a few of them.
# Monthly Income & Monthly Loan Payment & Term
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment,
color = factor(ld$Term))) +
geom_point(size = 3) +
xlim(c(0,50000)) +
scale_color_manual(values = term_colors) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income") +
labs(color = "Term")
# Monthly Income & Monthly Loan Payment & Term
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment,
color = factor(ld$Term))) +
geom_point(size = 3, alpha = 1/50) +
xlim(c(0,50000)) +
scale_color_manual(values = term_colors) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Term")
# Monthly Income & Monthly Loan Payment & Term
ggplot(ld, aes(ld$StatedMonthlyIncome, ld$MonthlyLoanPayment,
color = factor(ld$Term))) +
geom_point(size = 3, alpha = 1/50) +
xlim(c(0,20000)) +
ylim(0,1500) +
scale_color_manual(values = term_colors) +
ylab("Monthly Loan Payment") +
xlab("Monthly Income") +
guides(col = guide_legend(override.aes = list(shape = 15,
size = 10, alpha = 1))) +
labs(color = "Term")
1-year long term rules the upper side of the plot. They borrow for higher monthly loan payment. But they are not likely to borrow a big loan. 3-year long term is spreading at the middle. They don’t pay too much or too little in a month. 5 years long term is almost everywhere. Except at the upper side of the plot. There are only 1 yearlong terms.
The “Borrowed Amount Count” fits onto the relationship between the loan amount and delinquencies in the last 7 years.
We can see the layers in the connection of interest rate and loan amount if we add the prosper rating categories. It definitely split the dataset by its categories.
There is a really strong relationship between the monthly payment, amount of loan and the term. There are 3 trend lines separated by the term variable.
The plot where we can see the connections between these 3 variables: Monthly Income, Monthly Loan Payment, and Prosper Rating. I would have not expected that how the categorical variable is distributed on the chart. It surprised me. ——
This is definitely one of the 3 plots that I would summarize in the end. And here it is why. I thought about this plot’s structure a lot. How should I find the sequences? I have a lot of work in this. I found that from 6 days to 336 days with a 30 step we can see the changes. Less people have delinquency at those times than next to them. It was hard to find.
Here we can see that how the prosper rating splits vertically the plot. As we get a worse rating, we get higher interest rate. And the borrowed money does not really have an influence on this, except in the HR category.
This is a good descriptive plot. When you look at it, you just know what is happening. You see the 3 trend lines connected with the term. It says it all. The shortest term has the highest monthly payment as we expected. Nothing strange.
I have investigated the relationships between many variables. But this is still a small part of the 81. But at least I think these are the most important variables. In the future we can investigate more of course, but this gives us a lot of information.
It was hard to get the sequences in the delinquencies plot. I thought a lot about it. Playing with the colors, the plots themselves et cetera. It is good to see now. And there were easier plots, like prosper ratings with interest rates. It is trivial that if you are in a bad rated group, you will have bigger interest rate.
There were some information which would have been good to know before hand. Because only from the dataset we can’t figure out everything. We can only guess.
It would be interesting to explore some date type variables in the future.